Hypertuned Deep Convolutional Neural Network for Sign Language Recognition

Sign language plays a pivotal role in the lives of impaired people having speaking and hearing disabilities. They can convey messages using hand gesture movements. American Sign Language (ASL) recognition is challenging due to the increasing intra-class similarity and high complexity. This paper used a deep convolutional neural network for ASL alphabet recognition to overcome ASL recognition challenges. This paper presents an ASL recognition approach using a deep convolutional neural network. The performance of the DeepCNN model improves with the amount of given data; for this purpose, we applied the data augmentation technique to expand the size of training data from existing data artificially. According to the experiments, the proposed DeepCNN model provides consistent results for the ASL dataset. Experiments prove that the DeepCNN gives a better accuracy gain of 19.84%, 8.37%, 16.31%, 17.17%, 5.86%, and 3.26% as compared to various state-of-the-art approaches.


Introduction
Sign language alphabets (SLAs) are created through facial and hand gestures. Ordinary people may not understand sign language; that is why it is used to express the feelings and thoughts of disabled people to normal people. Hand gestures have been used in verbal communication since the inception of the human race. It is used vastly in the medical domain and sign language interpretation [1]. Sign language is used by nearly 2500000 people from all around the world.
ere are various approaches developed to help disabled people speak and hear. It is not easy for them to find help and a translator in their daily activities. A novel approach can overcome this problem and initiate communication between disabled and normal people. ere are about 100 different sign languages used for various purposes like classification and understanding the thoughts of disabled people. Some of these are American Sign Language (ASL), Indian Sign Language (ISL), Italian Sign Language, etc. Sign language is India's primary mode of communication for millions of people. e ASL is a highly used language for sign language alphabet recognition [2]. More than 30 nations utilize the ASL. A million people in the USA used ASL as their mode of communication.
e ASL is a complex and highly used language, and it is created by using fingers, actions, and hand and facial gestures to convey the thoughts of the disabled population. Also, it spreads happiness and hopes amongst disabled humans [3,4]. e ASL consists of 26 gestures. It is known as American Manual Alphabet. It represents different words presented in the English dictionary. e 26 ASL alphabets consist of 19 different handshapes used to communicate in ASL. ere are fewer hand shapes, so some of the handshapes express different alphabets if the position of the specific hand shape is changed like "P" and "K" letters. Most of the hand gestures are utilized to represent the numbers from "0" to "9," but no hand gesture belongs to the specific terms or nouns [5,6]. ere are different hand and facial gestures used to present various English words. Figure 1 shows the gesture signs of every single English alphabet from A to Z. e gesture recognition is further divided into two parts. e first one is static, and the second is dynamic [7]. e pattern recognition problem [8][9][10] belongs to the static gesture recognition, where the feature extraction is the part of preprocessing step [11,12]. Feature extraction is an essential step in every conventional pattern recognition task. e static gestures require only a single image for processing input to the classifier, and it takes less computational cost.
On the other hand, the dynamic gesture is the most challenging task in computer vision [13]. It requires that a sequence of images and gestures are recognized based on features extracted from the proposed feature extraction algorithm [14][15][16][17]. e deaf people mainly focus on learning the hand gestures for alphabets and digits to interact with others; hence, this study shows a precise analysis between different classes to identify the correct hand gestures letters of ASL [18]. Twenty-four different gestures of the ASL MNIST dataset were used for classification; some of them have significant inter-class similarities. A deep neural network is commonly used for ASL recognition. e DeepCNN-based algorithm is used for ASL alphabet recognition.
e main contributions of this work are as follows: (i) Propose a deep learning-based DeepCNN algorithm to recognize 24 alphabets from ASL data. (ii) Expand the data size using the data augmentation technique for better training and use a trained model for prediction. (iii) Evaluate the performance of the proposed approach using recognition accuracy, which outperforms the existing state-of-the-art approaches with the highest gain of 19.84%. e rest of this article is organized as follows. Section 2 contains an overview of the available relevant literature. Section 3 gives a full description of the dataset. Section 4 explains the recommended technique. Section 5 presents the experimental results and a comparison with the baseline. Finally, we conclude this study in Section 6, along with future work directions.

Related Work
Various techniques have been utilized to solve the problem of sign language gesture recognition [18]. Many previous works have used SVM to classify gestures in ASL [19]. e hidden Markov model (HMM) and SVM for ASL recognition were also used. e proposed approach was used to classify sign language alphabets with a success rate of 86.67%. Furthermore, multiple kinds of research have shown interest [20][21][22] in recognition of dynamic hand gestures. It is challenging to identify dynamic hand gestures, and researchers have been putting efforts into it during the last decade. Sometimes different people use the same sign, but it appears different. e authors in [23] proposed a deep learning-based approach for the classification of ASL. ey also used their self-generated dataset for sign language recognition. ey achieved a classification accuracy of 82.5%.
Several methods have been used for ASL recognition based on motion gloves, image processing, and leap motion controllers. e authors in [24] proposed an ANN-based model to identify the 3D motion based on 50 ASL words. It consumes much time and is computationally expensive approach [25][26][27]. Many researchers developed multiple approaches for ASL recognition, but due to the inter-class variations, sign complexity, and high inter-class similarity, it is still a challenging task [28,29]. e authors in [30] proposed an ASL recognition system. e proposed system used a 3D motion sensor. ey used K-nearest neighbor (KNN) and support vector machine (SVM) to classify 26 English alphabets. ey used five palm and four-finger features derived from sensory data. e KNN model achieved 72.78% accuracy, while the SVM model achieved an accuracy of 79.83%. e ASL gesture recognition for real life is such a challenging task. It is not easy because it requires robustness, efficiency, and accuracy. e authors in [31] presented an effective hand gesture recognition system named LMC to obtain multiple information. ey used the proposed system to identify several fingers, fingertips, and hand positions. Furthermore, they used these gestures for sign language recognition. ey used the SVM model as a classification algorithm. is classifier evaluated the highest confidence class. is class is further assigned for the hand gesture. e 28 static hand gestures from arce are used. e proposed SVM algorithm is used to recognize these static hand gestures.
is approach successfully recognized 28 hand gestures and 0-9 digits with an accuracy rate of 91%.
Furthermore, Chong and Lee [32] presented an approach for ASL recognition. ere were 26 sign language 2 Computational Intelligence and Neuroscience alphabets and ten digits used in the leap motion controller. e features are divided into six sets of combinations with 23 features. e findings indicate that the distance between the two adjacent fingertips is significant. ey used DNN-based algorithm for sign language recognition. e proposed DNN algorithm performed well on both 26 and 36-class ASL datasets but did not perform well on digits because of the high inter-class similarity between letters and digits. Compared to all of the mentioned works, the proposed approach is very efficient for ASL recognition. We used 24 ASL alphabets and fine-tuned deep CNN algorithm for sign language recognition and provided better performance than all the works mentioned above.

Dataset Description
To effectively evaluate the overall performance of the proposed approach, we perform experiments using a vastly used publicly available sign language dataset from Modified National Institute of Standards and Technology (MNIST) database that consists of ASL alphabetic letters of hand gestures. Utilizing the Sign Language MNIST dataset from Kaggle (https://www.kaggle.com/datamunge/sign-languagemnist), we assessed models to arrange hand signals for each letter of the letters in order. Because of the movement associated with the letters J and Z, these letters were excluded from the dataset. Nonetheless, the information incorporates roughly 35, 000 28x28 pixel pictures of the remaining 24 letters. Like the first MNIST hand-drawn pictures, the information contains various grayscale values for the 784 pixels in each picture. e dataset is divided into training, validation, and testing. e training and testing dataset consists of labels (0 − 25) representing each alphabet from A − Z except 9 − j and 25 � Z due to its gesture motions. e number of training samples of each label is presented in Figure 2.
Initially, we had 27,455 training cases and 7172 test cases. In the study, we further divided the original training set into a new training set, which consisted of 24,710 cases and the validation dataset contains 2745 cases, and the test dataset contains 7172 cases with a row of attributes starting with pixel1, pixel2 up to pixel784, representing 28 × 28-pixel image with 0-255 grayscale image value. An example of sign language MNIST is represented in Figure 3.

Proposed System
We propose a CNN-based architecture for sign language alphabet recognition. e proposed CNN-based architecture is very effective for sign language alphabet recognition.
e convolutional layers of the CNN model get the feature map by executing convolution on input with different filter sizes and kernels. It is defined in the following equation: where i represents the layer number, j counts the total number of output maps, * represents the convolutional operation, and a j 1 represent the output features where the input is represented by M j . e kernel size is represented by k 1 ij at the i − th layer in the CNN. e bias factor is represented by b i j , and f shows the activation function. e max-pooling is a part of the subsampling layer. It calculates the mean and max value over the divided features into different regions. e subsampling layer is defined in equation (2). Here g shows the subsampling layers, and the subsampling region is represented by R j .
(2) Figure 4 shows the proposed architecture of the DeepCNN model for ASL recognition. e proposed finetuned CNN architecture contains multiple convolutional layers, max-pooling, dropout, and dense (fully connected). e input data need to be augmented, which involves augmenting the existing dataset with perturbed current images, including scaling and rotating.
is is used to expose the neural network to a variety of variations. is way, the neural network is less likely to identify unwanted characteristics in the dataset. e architecture has three main blocks with different parameter settings. e first block has 32 filters with 3 * 3 kernel size, and the ReLU is used as an activation function. In the next layer, the 2 * 2 max-pooling layer is used with half padding, which progressively reduces the spatial size of the representation to the reduced number of parameters and computation in the model. 128 filters use the 3 * 3 kernel size with the ReLU activation function in the second block. After that, a 2 * 2 max-pooling layer is used with half padding. In the third block, 512 filters use the 3 * 3 kernel size with the ReLU activation function. Again in the third block 2 * 2, the maxpooling layer is used with half padding. From these threeblock models, learn features properly. ese features were flattened by using flatten layer, which converts data to a vector before being connected to a group of the fully connected layer. In the last, two dense layers are used with ReLU activation function with 1024 and 256 units  Computational Intelligence and Neuroscience 3 respectively than dropout layer with the value of 0.5 to control overfitting. Finally, we used 25-unit dense layer (fully connected layer) as an output layer and used a softmax function to predict the gesture of the ASL alphabet. Figure 5 shows the overview of the proposed approach for ASL recognition. In the first stage, the input images are split into training, validation, and testing data and then training data are augmented. We used an image data generator, expanded the training dataset's size, and created a modified version of images in the dataset. e augmented data are passed to train the fine-tuned CNN model. In the second stage, features are extracted by passing the data through three blocks, as shown in Figure 4. After applying the softmax activation function, these features classify the ASL alphabets in the next stage. en, the next stage is to predict the unseen test data. We use unseen test data to testify the model's capability for ASL recognition. Finally, the data are classified, and the predicted output is achieved. Figure 6 shows the parameters and output shape of each layer used in the CNN architecture. e total number of trainable calculated parameters is 2,994,649. e proposed CNN model learns the hand gesture in the training stage. e model is allowed to check all the pixels in the images. In the testing phase, we use an unseen hand gesture dataset. If any pixel has the hand gesture H i , the output layer node O i returns the maximum response. e model will return "1" or "on" state. Suppose there are P j pixels P 1 , P 2 , P 3 , . . ., P Pj in the image I j (where j � 1, 2, 3, . . ., 9000). When a pixel P k is passed to the CNN, it will return the output as 24 . Algorithm 1 shows the testing phase of the proposed approach.

Experimental Results and Analysis
e primary goal of this research is to assess the classification performance of a proposed CNN classifier for the      capabilities using several evaluation metrics. Precision, recall, and F1-score are the evaluation measures used in this study. We divided the data for experimentation into two parts: 90% for training and 10% for validation. 70% of the data are utilized for training, and we used 20% data for testing. We also partition the training set into test and train datasets.

Results.
In this research work, we employed the MNIST dataset, which contains ASL alphabetic letters of hand movements, and analyzed the performance using the given evaluation metrics (accuracy, precision, recall, and F1score). We employed the proposed CNN to classify data for recognizing sign language alphabets. e Sign Language MNIST dataset has 24 classes (excluding J and Z). e testing data are accessible separately in the experiments, with 7172 images. e Python, Keras, and TensorFlow libraries were used for the analyses. e proposed model is trained on a dataset of 34,627 images using an NVIDIA GTX 1060 GPU. e classification results of the proposed CNN model for sign language recognition are shown in Table 1. We repeated the experiments almost six times, changing the learning rate. For the first time, we use the 0.00075 learning rate to achieve a training accuracy of 23.37%, which is very low accuracy, and a validation accuracy of 79.09%; then, we change the learning rate to 0.00050, and the training and validation accuracy increase to 98.69% and 98.83%, respectively. During training, each epoch measures validation accuracy, and if there is no change in validation loss between two epochs, learning rate reduction decreases the learning rate automatically. At the     Figure 8. It demonstrates that the suggested trained model performed well on unseen data and correctly predicted all classes. On unseen data, we additionally calculate the accuracy, recall, and F1-score. On test data, the accuracy, recall, and F1-score are all 99%, 99%, and 99%.
In the end, per class, confusion matrices for unseen data were generated as shown in Figure 9. It is transformed into matrices of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) [33]. Furthermore, we calculate accuracy, sensitivity, and specificity as a result. To testify the model's capabilities, we calculated accuracy, sensitivity, and specificity for each alphabet as shown in Figure 10. e per-class precision, recall, and F1-score are calculated. Over 97% score implies that the model achieves a higher probability of incorrectly identifying negative results in each of the 24 classes, and the proportion of accurately identified classes would be higher.

Comparative Analysis with Baseline Approach.
We analyze the proposed approach to compare the results with different state-of-the-art studies. e proposed approach performed very well as compared to all the baseline approaches. Table 2 provides an overview of comparative analysis of this study with multiple baseline approaches. For ASL MNIST dataset baseline approach, the method in [30] achieved 79.83% accuracy using the SVM model on 26 ASL gesture dataset. e study in [31] also used the SVM approach on ten digit ASL gesture dataset and achieved an accuracy of 91.30%. e study in [34] used ten selected gestures for experimentation and got an accuracy of 83.36% using the SVM model.
Another study in [23] used deep CNN to classify 24 ASL gesture datasets and attain the accuracy of 82.5%. Furthermore, the study in [32] used 26 ASL gesture (A-Z) and 36 ASL gesture (A-Z, 0-9) datasets for experimentation using the DNN approach and got an accuracy of 93.81%. In the end, we compare our results with the work in [32], which used 30 ASL gestures (12 dynamic signs and 18 static signs) for experiments and performed classification using the RNN approach and achieved the accuracy of 96.41%. Compared to baseline approaches, the proposed approach outperforms all the existing approaches with the accuracy gain of 19.84%, 8.37%, 16.31%, 17.17%, 5.86%, and 3.26%.

Conclusion
Several researchers have tried to overcome hand gesture recognition in real life. It is challenging due to its different efficiency, robustness, and accuracy requirements. In this study, we proposed a robust ASL recognition approach that involves 24 alphabets that are used as a sign language. e proposed approach is based on deep convolutional neural networks to recognize the sign language alphabets. e proposed DeepCNN model can recognize the ASL alphabets with an accuracy rate of 99.67% on unseen test data. Initially, we utilized a single convolutional layer that overfits the data. We added two more convolutional layers to handle this problem, resulting in the better performance of the proposed algorithm. In the future, we plan to extend this work for realtime sign recognition data provided by the leap motion controller. We also intend to recognize sign language gestures through video frames, which is a challenging task.
Data Availability e dataset used in this work can be found at "https://www. kaggle.com/datamunge/sign-language-mnist."

Conflicts of Interest
e authors declare that they have no conflicts of interest. Table 2: Performance comparison of the proposed approach with the baseline approaches.